f-Divergence constrained policy improvement
نویسندگان
چکیده
To ensure stability of learning, state-of-the-art generalized policy iteration algorithms augment the policy improvement step with a trust region constraint bounding the information loss. The size of the trust region is commonly determined by the Kullback-Leibler (KL) divergence, which not only captures the notion of distance well but also yields closed-form solutions. In this paper, we consider a more general class of f -divergences and derive the corresponding policy update rules. The generic solution is expressed through the derivative of the convex conjugate function to f and includes the KL solution as a special case. Within the class of f -divergences, we further focus on a one-parameter family of α-divergences to study effects of the choice of divergence on policy improvement. Previously known as well as new policy updates emerge for different values of α. We show that every type of policy update comes with a compatible policy evaluation resulting from the chosen f -divergence. Interestingly, the mean-squared Bellman error minimization is closely related to policy evaluation with the Pearson χ-divergence penalty, while the KL divergence results in the soft-max policy update and a log-sum-exp critic. We carry out asymptotic analysis of the solutions for different values of α and demonstrate the effects of using different divergence functions on a multi-armed bandit problem and on common standard reinforcement learning problems.
منابع مشابه
Constrained Policy Optimization
For many applications of reinforcement learning it can be more convenient to specify both a reward function and constraints, rather than trying to design behavior through the reward function. For example, systems that physically interact with or around humans should satisfy safety constraints. Recent advances in policy search algorithms (Mnih et al., 2016; Schulman et al., 2015; Lillicrap et al...
متن کاملConstrained hyperbolic divergence cleaning for smoothed particle magnetohydrodynamics
We present a constrained formulation of Dedner et al’s hyperbolic/parabolic divergence cleaning scheme for enforcing the ∇ · B = 0 constraint in Smoothed Particle Magnetohydrodynamics (SPMHD) simulations. The constraint we impose is that energy removed must either be conserved or dissipated, such that the scheme is guaranteed to decrease the overall magnetic energy. This is shown to require usi...
متن کاملRadar Waveform Design for Detection Performance Improvement
In this dissertation, we study the problem of waveform design for improvement of detection performance in radar systems. To this end, for single-input single-output (SISO) systems, we consider the effect of the signal-dependent interference at receive side (i.e., clutter) and the fact that Doppler shift of targets are often unknown at the transmit side. The raised design problems (for various s...
متن کاملEasy Monotonic Policy Iteration
A key problem in reinforcement learning for control with general function approximators (such as deep neural networks and other nonlinear functions) is that, for many algorithms employed in practice, updates to the policy or Q-function may fail to improve performance—or worse, actually cause the policy performance to degrade. Prior work has addressed this for policy iteration by deriving tight ...
متن کاملIran’s Foreign Policy Approaches toward International Organizations
Iran’s foreign policy toward international organizations has always oscillated between divergence and convergence, depending on the status of the country in question and the statesmen's point of the view. This study aimed to examine the status of international organizations in Iran’s foreign policy. A divergent approach to international organizations was adopted during 1981-1988 and 2005-2013. ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1801.00056 شماره
صفحات -
تاریخ انتشار 2017